{ "nbformat": 4, "nbformat_minor": 4, "metadata": { "title": "Shuffle vs. Resample", "kernelspec": { "name": "ir", "display_name": "R", "language": "R" }, "language_info": { "name": "R", "codemirror_mode": "r", "pygments_lexer": "r", "mimetype": "text/x-r-source", "file_extension": ".r", "version": "4.0.3" } }, "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "I created this notebook to help us play around with the concepts of shuffle and resample." ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 58, "source": [ "# This code will load the R packages we will use\n", "\n", "suppressPackageStartupMessages({\n", " library(mosaic)\n", " library(supernova)\n", " library(ggpubr)\n", " #library(Lock5withR)\n", " #library(fivethirtyeight)\n", "})\n", "\n", "font_size = function (size) {\n", " theme(text = element_text(size = size))}\n", "\n", "\n", "middle <- function(x, prop = .95) {\n", " sorted <- sort(x)\n", " tail_size <- (1 - prop) / 2\n", " upper_cut <- sorted[floor(length(x) * (1 - tail_size))]\n", " lower_cut <- sorted[ceiling(length(x) * tail_size)]\n", " ((x <= upper_cut) + (x > lower_cut)) == 2\n", "}\n", "\n", "\n", "# Adjust the plots to have nice sizes\n", "options(repr.plot.width = 6, repr.plot.height = 4)\n", "\n", "" ], "outputs": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## An Extreme Data Set\n", "\n", "It may be easier to appreciate what `shuffle()` is doing with a more extreme example rather than a more realistic data set!\n", "\n", "Imagine two groups, A and B. All cases in group A have an outcome of 0 and all those in B have an outcome of 10. For now let's assume there are 50 cases in each group (but you can play around and set `n` to anything you want)." ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 40, "source": [ "n <- 50\n", "\n", "group <- c(rep(\"A\", n), rep(\"B\",n))\n", "outcome <- c(rep(0, n), rep(10,n))\n", "extremedata <- data.frame(outcome, group)\n", "\n", "head(extremedata, 2*n)" ], "outputs": [ { "output_type": "execute_result", "execution_count": 40, "metadata": {}, "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
outcomegroup
<dbl><chr>
10A
20A
30A
40A
50A
60A
70A
80A
90A
100A
110A
120A
130A
140A
150A
160A
170A
180A
190A
200A
210A
220A
230A
240A
250A
260A
270A
280A
290A
300A
7110B
7210B
7310B
7410B
7510B
7610B
7710B
7810B
7910B
8010B
8110B
8210B
8310B
8410B
8510B
8610B
8710B
8810B
8910B
9010B
9110B
9210B
9310B
9410B
9510B
9610B
9710B
9810B
9910B
10010B
" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linking _randomness_ and `shuffle()`\n", "\n", "We're trying to link together \"randomly generated data\" to `shuffle()` to having 0 difference between groups (represented as $\\beta_1 = 0$ and ultimately $Y_i = \\beta_0 + \\epsilon_i$). Let's take a look at why.\n", "\n", "Our groups start off very extreme. Group perfectly predicts outcome. If you knew a case was in group A, you would predict they would have an outcome of 0 and there would be no error.\n", "\n", "Indeed, if you look at the `favstats` you'll see that the means are 0 and 10 (and there is no spread at all within groups). In this case, the data have a very large $b_1$! That is, $b_1 = 10$!" ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 72, "source": [ "gf_histogram(~ outcome, data = extremedata) %>%\n", "gf_facet_grid(group ~ .) %>%\n", "gf_vline(xintercept = ~mean, data = favstats(outcome ~ group, data = extremedata))\n", "\n", "favstats(outcome ~ group, data = extremedata)\n", "\n", "print(\"This is the sample's b1:\")\n", "b1(outcome ~ group, data = extremedata)" ], "outputs": [ { "output_type": "execute_result", "execution_count": 72, "metadata": {}, "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
groupminQ1medianQ3maxmeansdnmissing
<chr><dbl><dbl><dbl><dbl><dbl><dbl><dbl><int><int>
A0000000500
B1010101010100500
" ] } }, { "output_type": "stream", "name": "stdout", "text": "[1] \"This is the sample's b1:\"\n" }, { "output_type": "execute_result", "execution_count": 72, "metadata": {}, "data": { "text/html": [ "10" ] } }, { "output_type": "display_data", "metadata": {}, "data": { "image/png": [ "" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What would happen if we shuffled these groups? (I'll color the shuffled data in `dodgerblue`.)" ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 73, "source": [ "gf_histogram(~ shuffle(outcome), data = extremedata, fill = \"dodgerblue\") %>%\n", "gf_facet_grid(group ~ .) " ], "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "image/png": [ "" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that because the groups are randomly generated, A has an equal chance of being associated with a high or low score (and so does B). So now take a look at the shuffled $b_1$! Observe how small these shuffled $b_1$s are. (Run the code below a few times.)" ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 71, "source": [ "extremedata$shuff_outcome <- shuffle(extremedata$outcome)\n", "\n", "gf_histogram(~ shuff_outcome, data = extremedata, fill = \"dodgerblue\") %>%\n", "gf_facet_grid(group ~ .) %>%\n", "gf_vline(xintercept = ~mean, data = favstats(shuff_outcome ~ group, data = extremedata))\n", "\n", "print(\"This is a shuffled b1:\")\n", "b1(shuff_outcome ~ group, data = extremedata)" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": "[1] \"This is a shuffled b1:\"\n" }, { "output_type": "execute_result", "execution_count": 71, "metadata": {}, "data": { "text/html": [ "1.6" ] } }, { "output_type": "display_data", "metadata": {}, "data": { "image/png": [ "" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The shuffled $b_1$s end up being smaller because the randomized groups are very similar to each other now! So we characterize the **random process (of shuffle)** as a DGP where $\\beta_1=0$, the true difference is 0. And now, many of the shuffled data look very much like they could have been generated by the empty model (the grand mean of the whole data set -- depicted in `blue`)." ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 70, "source": [ "extremedata$shuff_outcome <- shuffle(extremedata$outcome)\n", "\n", "gf_histogram(~ shuff_outcome, data = extremedata, fill = \"dodgerblue\") %>%\n", "gf_facet_grid(group ~ .) %>%\n", "gf_vline(xintercept = ~mean, data = favstats(shuff_outcome ~ group, data = extremedata)) %>%\n", "gf_vline(xintercept = ~mean, data = favstats(~shuff_outcome, data = extremedata), color = \"blue\")\n", "" ], "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "image/png": [ "" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Linking _DGP is basically just like our data_ and `resample()` (resampling our cases)\n", "\n", "We're also trying to link together \"DGP is like our data\" to `resample()` to saying as $\\beta_1 = \\b_1 = 10$ and ultimately $Y_i = \\beta_0 + 10X_i + \\epsilon_i$). Let's take a look at why.\n", "\n", "When we resample (depicted in `tomato`), we get a very similar plot to our original data." ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 69, "source": [ "gf_histogram(~ outcome, data = resample(extremedata), fill = \"tomato\") %>%\n", "gf_facet_grid(group ~ .) " ], "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "image/png": [ "" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To make it a bit easier to see, here we've put the original extreme data (in gray), the resampled data (in `tomato`), and shuffled data (in `dodgerblue`) next to each other." ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 76, "source": [ "originalplot <- gf_histogram(~ outcome, data = extremedata) %>%\n", "gf_facet_grid(group ~ .) \n", "\n", "resampleplot <- gf_histogram(~ outcome, data = resample(extremedata), fill = \"tomato\") %>%\n", "gf_facet_grid(group ~ .) \n", "\n", "shuffleplot <- gf_histogram(~ shuff_outcome, data = extremedata, fill = \"dodgerblue\") %>%\n", "gf_facet_grid(group ~ .) \n", "\n", "ggarrange(originalplot, resampleplot, shuffleplot, \n", " labels = c(\"Original\", \"Resample\", \"Shuffle\"),\n", " ncol = 3)" ], "outputs": [ { "output_type": "display_data", "metadata": {}, "data": { "image/png": [ "" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that because the groups are made up of resampled cases, resampled group A is still made up of cases that used to be in the original group A (and the same with group B). So now take a look at the resampled $b_1$! Indeed, if you run the code a few times, the resampled $b_1$s are the same as our original $b_1$." ] }, { "cell_type": "code", "metadata": { "trusted": false }, "execution_count": 75, "source": [ "resampledata <- resample(extremedata)\n", "\n", "gf_histogram(~ outcome, data = resampledata, fill = \"tomato\") %>%\n", "gf_facet_grid(group ~ .) %>%\n", "gf_vline(xintercept = ~mean, data = favstats(outcome ~ group, data = resampledata))\n", "\n", "print(\"This is a resampled b1:\")\n", "b1(outcome ~ group, data = resampledata)\n", "" ], "outputs": [ { "output_type": "stream", "name": "stdout", "text": "[1] \"This is a resampled b1:\"\n" }, { "output_type": "execute_result", "execution_count": 75, "metadata": {}, "data": { "text/html": [ "10" ] } }, { "output_type": "display_data", "metadata": {}, "data": { "image/png": [ "" ] } } ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So to recap resample, we characterize the **DGP like our sample** as a DGP where $\\beta_1=b_1$, the true difference is the same as that of our sample (10). And now, many of the resampled data look very much like they could have been generated by the best fitting complex model model (the two group means).\n", "\n", "## A note about wording\n", "\n", "For our purposes, it's easier to call it `shuffle()` versus `resample()` because those are the R functions we use. In data science and statistics, they call `shuffle()`!number(0)**RANDOMIZATION** and they call `resample()`!number(0)**BOOTSTRAPPING**." ] } ] }